Goto

Collaborating Authors

 nonsmooth function



Provably Correct Automatic Sub-Differentiation for Qualified Programs

Sham M. Kakade, Jason D. Lee

Neural Information Processing Systems

The Cheap Gradient Principle [Griewank and Walther, 2008] -- the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of 5) as that of simply computing the function itself -- is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing subderivatives: widely used ML libraries, including TensorFlow and PyTorch, do not correctly compute (generalized) subderivatives even on simple examples. This work considers the question: is there a Cheap Subgradient Principle? Our main result shows that, under certain restrictions on our library of nonsmooth functions (standard in nonlinear programming), provably correct generalized subderivatives can be computed at a computational cost that is within a (dimension-free) factor of 6 of the cost of computing the scalar function itself.


Better Approximation and Faster Algorithm Using the Proximal Average

Neural Information Processing Systems

It is a common practice to approximate "complicated" functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool-- proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims.


Nonsmooth Nonparametric Regression via Fractional Laplacian Eigenmaps

Shi, Zhaoyang, Balasubramanian, Krishnakumar, Polonik, Wolfgang

arXiv.org Machine Learning

Laplacian based nonparametric regression is a widely used approach in machine learning that leverages the Laplacian Eigenmaps algorithm to perform regression tasks without relying on explicit parametric models. The nonparametric nature of the approach makes it flexible and adaptable to data generating process without imposing strict assumptions about the functional form of the relationship between the response and the covariates. Existing theoretical studies of this approach are restricted to establishing minimax rates of convergence and adaptivity properties when the true regression function lies in Sobolev spaces; see Section 1.1 for details. Such spaces are inherently smooth in nature and exclude important function classes in nonparametric statistics, such as piecewise constant or polynomial functions, bump functions and other such nonsmooth function classes. In this work, using the framework of fractional Laplacians, we propose a novel approach called Principal Component Regression using Fractional Laplacian Eigenmaps (PCR-FLE) for nonsmooth and nonparametric regression. The PCR-FLE algorithm generalizes the PCR-LE algorithm by Green et al. (2023) and the PCR-WLE algorithm by Shi et al. (2024), and is designed to naturally handle the case when the true regression function lies in an L


Learning in Feedforward Networks with Nonsmooth Functions

Neural Information Processing Systems

This paper is concerned with the problem of learning in networks where some or all of the functions involved are not smooth. Examples of such networks are those whose neural transfer functions are piecewise-linear and those whose error function is defined in terms of the 100 norm. Up to now, networks whose neural transfer functions are piecewise-linear have received very little consideration in the literature, but the possibility of using an error function defined in terms of the 100 norm has received some attention. In this latter work, however, the problems that can occur when gradient methods are used for non smooth error functions have not been addressed. In this paper we draw upon some recent results from the field of nonsmooth optimization (NSO) to present an algorithm for the non smooth case.


Provably Correct Automatic Sub-Differentiation for Qualified Programs

Kakade, Sham M., Lee, Jason D.

Neural Information Processing Systems

The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of partial derivatives of a scalar function is nearly the same (often within a factor of $5$) as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing sub-derivatives: widely used ML libraries, including TensorFlow and PyTorch, do \emph{not} correctly compute (generalized) sub-derivatives even on simple differentiable examples. This work considers the question: is there a \emph{Cheap Sub-gradient Principle}? Our main result shows that, under certain restrictions on our library of non-smooth functions (standard in non-linear programming), provably correct generalized sub-derivatives can be computed at a computational cost that is within a (dimension-free) factor of $6$ of the cost of computing the scalar function itself.


Provably Correct Automatic Sub-Differentiation for Qualified Programs

Kakade, Sham M., Lee, Jason D.

Neural Information Processing Systems

The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of partial derivatives of a scalar function is nearly the same (often within a factor of $5$) as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing sub-derivatives: widely used ML libraries, including TensorFlow and PyTorch, do \emph{not} correctly compute (generalized) sub-derivatives even on simple differentiable examples. This work considers the question: is there a \emph{Cheap Sub-gradient Principle}? Our main result shows that, under certain restrictions on our library of non-smooth functions (standard in non-linear programming), provably correct generalized sub-derivatives can be computed at a computational cost that is within a (dimension-free) factor of $6$ of the cost of computing the scalar function itself.


Provably Correct Automatic Subdifferentiation for Qualified Programs

Kakade, Sham, Lee, Jason D.

arXiv.org Machine Learning

The Cheap Gradient Principle (Griewank 2008) --- the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of $5$) as that of simply computing the function itself --- is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing subderivatives: widely used ML libraries, including TensorFlow and PyTorch, do not correctly compute (generalized) subderivatives even on simple examples. This work considers the question: is there a Cheap Subgradient Principle? Our main result shows that, under certain restrictions on our library of nonsmooth functions (standard in nonlinear programming), provably correct generalized subderivatives can be computed at a computational cost that is within a (dimension-free) factor of $6$ of the cost of computing the scalar function itself.


Stochastic Smoothing for Nonsmooth Minimizations: Accelerating SGD by Exploiting Structure

Ouyang, Hua, Gray, Alexander

arXiv.org Machine Learning

In this work we consider the stochastic minimization of nonsmooth convex loss functions, a central problem in machine learning. We propose a novel algorithm called Accelerated Nonsmooth Stochastic Gradient Descent (ANSGD), which exploits the structure of common nonsmooth loss functions to achieve optimal convergence rates for a class of problems including SVMs. It is the first stochastic algorithm that can achieve the optimal O(1/t) rate for minimizing nonsmooth loss functions (with strong convexity). The fast rates are confirmed by empirical comparisons, in which ANSGD significantly outperforms previous subgradient descent algorithms including SGD.